Back

Genome Research

Cold Spring Harbor Laboratory

Preprints posted in the last 90 days, ranked by how well they match Genome Research's content profile, based on 409 papers previously published here. The average preprint has a 0.15% match score for this journal, so anything above that is already an above-average fit.

1
10-minimizers: a promising class of constant-space minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-03-18 bioinformatics 10.64898/2026.03.16.712052 medRxiv
Top 0.1%
25.8%
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.

2
A multi-flow approach for binning circular plasmids from short-reads assembly graphs

Epain, V.; Mane, A.; Della Vedova, G.; Bonizzoni, P.; Chauve, C.

2026-03-26 genomics 10.64898/2026.03.25.714305 medRxiv
Top 0.1%
19.4%
Show abstract

We address the problem of plasmid binning, that aims to group contigs - from a draft short-read assembly for a bacterial sample - into bins each expected to correspond to a plasmid present in the sequenced bacterial genome. We formulate the plasmid binning problem as a network multi-flow problem in the assembly graph and describe a Mixed-Integer Linear Program to solve it. We compare our new method, PlasBin-HMF, with state-of-the-art methods,MOB-recon, gplasCC, and PlasBin-flow, on a dataset of more than 500 bacterial samples, and show that PlasBin-HMF outperforms the other methods, by preserving the explainability.

3
MaxGeomHash: An Algorithm for Variable-Size Random Sampling of Distinct Elements

Hera, M. R.; Koslicki, D.; Martinez, C.

2026-02-25 bioinformatics 10.1101/2025.11.11.687920 medRxiv
Top 0.1%
18.6%
Show abstract

With the surge in sequencing data generated from an ever-expanding range of biological studies, designing scalable computational techniques has become essential. One effective strategy to enable large-scale computation is to split long DNA or protein sequences into k-mers, and summarize large k-mer sets into compact random samples (a.k.a. sketches). These random samples allow for rapid estimation of similarity metrics such as Jaccard or cosine, and thus facilitate scalable computations such as fast similarity search, classification, and clustering. Popular sketching tools in bioinformatics include Mash and sourmash. Mash uses the MinHash algorithm to generate fixed-size sketches; while sourmash employs FracMinHash, which produces sketches whose size scales linearly with the total number of k-mers. Here, we introduce a novel sketching algorithm, MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW, which for a specified integer parameter b [≥] 1, will produce, without prior knowledge of n (the number of k-mers) a random sample of size b lg(n/b) + [O](b). Notably, this is the first permutation-invariant and parallelizable sketching algorithm to date that can produce sub-linear sketches, to the best of our knowledge. We also introduce a variant, -MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW, that produces sketches of size{Theta} (n) for a given [isin] (0, 1). We study the algorithms properties, analyze generated sample sizes, verify theoretical results empirically, provide a fast implementation, and investigate similarity estimate quality. With intermediate-sized samples between constant (MinHash) and linear (FracMinHash), MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW balances efficiency (smaller samples need less storage and processing) with accuracy (larger samples yield better estimates). On genomic datasets, we demonstrate that MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW sketches can be used to compute a similarity tree (proxy for a phylogenetic tree) more accurately than MinHash, and more efficiently than FracMinHash. Our C++ implementation is available at: github.com/mahmudhera/kmer-sketch. Code to reproduce the analyses and experiments is at: github.com/KoslickiLab/MaxGeomHash.

4
Generating minimum-density minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-01-28 bioinformatics 10.64898/2026.01.25.701585 medRxiv
Top 0.1%
18.4%
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k - 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. While the hardness of finding a minimizer of minimum density for given input parameters ({sigma}, k, w) is unknown, it has a huge search space of ({sigma}k)! and there is no known algorithm apart from a trivial brute-force search. In this paper, we tackle the minimum density problem for minimizers. We first formulate this problem as an ILP of size{Theta} (w{sigma}w+k), which has worst-case solution time that is doubly-exponential in (k + w) under standard complexity assumptions. Our experiments show that an ILP solver terminates with an optimal solution only for very small k and w. We then present our main method, called OptMini, which computes an optimal minimizer in [Formula] time and thus is capable of processing large w values. In experiments, OptMini works much faster than the runtime predicts due to several additional tricks shrinking the search space without harming optimality. We use OptMini to compute minimum-density minimizers for ({sigma}, k) [isin] {(2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (4, 2)} and w [isin] [2, 3{sigma}k], with the exception of certain w-ranges for k = 6 and the single case of k = 5, w = 2. Finally, we derive conclusions and insights regarding the density values as a function of w, patterns in optimal minimizer orders, and the relation between minimum-size universal hitting sets and minimum-density minimizers.

5
PC-index, a composite metric for gene expression in single-cell RNA-seq

Zhang, R.

2026-01-23 genomics 10.64898/2026.01.21.700965 medRxiv
Top 0.1%
18.3%
Show abstract

Single-cell RNA sequencing enables analysis of gene expression heterogeneity, but summarizing gene expression across large cell populations remains challenging. Gene expression is commonly described using average expression (counts per 10,000 transcripts, CP10K) and cellular prevalence (the fraction of cells expressing a gene), which can be difficult to interpret jointly. Here, we introduce the PC-index, a single, intuitive metric that integrates expression magnitude and prevalence. The PC-index is defined as the largest value X such that at least X% of cells express a gene at [≥]X/10 CP10K. Using human adipocytes from GTEx single-nucleus RNA-seq as an illustrative example, we show that adipocyte genes exhibit distinct PC-index profiles. For example, adiponectin has a PC-index of 21, indicating that at least 21% of adipocytes express it at [≥]2.1 CP10K. Conceptually, the PC-index reflects both percentage and CP10K, yielding a stable and interpretable single-number summary of gene expression behavior in single-cell data.

6
New Space-Time Tradeoffs for Subset Rank and k-mer Lookup

Diseth, A. C.; Puglisi, S. J.

2026-03-18 bioinformatics 10.64898/2026.03.16.712042 medRxiv
Top 0.1%
17.7%
Show abstract

Given a sequence S of subsets of symbols drawn from an alphabet of size{sigma} , a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space--less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.

7
scRGCL: Neighbor-Aware Graph Contrastive Learning for Robust Single-Cell Clustering

Fan, J.; Liu, F.; Lai, X.

2026-03-18 bioinformatics 10.64898/2026.03.16.712039 medRxiv
Top 0.1%
17.1%
Show abstract

Accurate cell type identification is a fundamental step in single-cell RNA sequencing (scRNA-seq) data analysis, providing critical insights into cellular heterogeneity at high resolution. However, the high dimensionality, zero-inflated, and long-tailed distribution of scRNA-seq data pose significant computational challenges for conventional clustering approaches. Although recent deep learning-based methods utilize contrastive learning to joint-learn representations and clustering assignments, they often overlook cluster-level information, leading to suboptimal feature extraction for downstream tasks. To address these limitations, we propose scRGCL, a single-cell clustering method that learns a regularized representation guided by contrastive learning. Specifically, scRGCL captures the cell-type-associated expression structure by clustering similar cells together while ensuring consistency. For each sample, the model performs negative sampling by selecting cells from distinct clusters, thereby ensuring semantic dissimilarity between the target cell and its negative pairs. Moreover, scRGCL introduces a neighbor-aware re-weighting strategy that increases the contribution of samples from clusters closely related to the target. This mechanism prevents cells from the same category from being mistakenly pushed apart, effectively preserving intra-cluster compactness. Extensive experiments on fourteen public datasets demonstrate that scRGCL consistently outperforms state-of-the-art methods, as evidenced by significant improvements in normalized mutual information (NMI) and adjusted rand index (ARI). Moreover, ablation studies confirm that the integration of cluster-aware negative sampling and the neighbor-aware re-weighting module is essential for achieving high-fidelity clustering. By harmonizing cell-level contrast with cluster-level guidance, scRGCL provides a robust and scalable framework that advances the precision of automated cell-type discovery in increasingly complex single-cell landscapes. Key MessagesO_LIscRGCL uses contrastive learning on a regularized representation for single-cell clustering. C_LIO_LIscRGCL outperforms four state-of-the-art methods on 15 datasets. C_LIO_LIscRGCLs cluster-aware negative sampling and the neighbor-aware re-weighting modules are essential for high-fidelity single cell clustering. C_LI

8
Hierarchical genomic feature annotation with variable-length queries

Alanko, J. N.; Ranallo-Benavidez, T. R.; Barthel, F. P.; Puglisi, S. J.; Marchet, C.

2026-03-18 bioinformatics 10.64898/2026.03.15.711907 medRxiv
Top 0.1%
17.1%
Show abstract

K-mer-based methods are widely used for sequence classification in metagenomics, pangenomics, and RNA-seq analysis, but existing tools face important limitations: they typically require a fixed k-mer length chosen at index construction time, handle multi-matching k-mers (whose origin in the indexed data is ambiguous) in ad-hoc ways, and some resort to lossy approximations, complicating interpretation. We present HKS, a data structure for exact hierarchical variable-length k-mer annotation. Building on the Spectral Burrows- Wheeler Transform (SBWT), a single HKS index is constructed for a specified maximum query length s, and supports queries at any length k [≤] s. HKS associates each k-mer with exactly one label from a user-defined category hierarchy, where multi-matching k-mers are resolved to their most specific common node in the hierarchy. We formalize a feature assignment framework that partitions indexed k-mers into disjoint sets according to a user-defined category hierarchy. To recover specificity lost to multi-matching and novel k-mers, we introduce a hierarchy-aware smoothing algorithm that makes use of flanking sequence context. We validate the approach by assigning each query k-mer to a specific chromosome across human genome assemblies, including the T2T-CHM13v2.0 reference as a positive control and two diploid genomes of different ancestries (HG002, NA19185). Smoothing increases overall concordance from [~]81% to [~]97%, with residual errors attributable to known biological phenomena including acrocentric short-arm recombination and subtelomeric duplications. In performance benchmarks against Kraken2, HKS provides comparable query throughput while providing exact, lossless annotation across all k-mer lengths simultaneously from a single index. A prototype implementation is available at https://github.com/jnalanko/HKS.

9
Significantly Improved Mouse and Rat Genome Annotation Using Sequence Read Archive RNA-seq Data

Meng, F.; Turner, D. L.; Hagenauer, M. H.; Watson, S.; Akil, H.

2026-03-09 genomics 10.64898/2026.03.06.709975 medRxiv
Top 0.1%
14.9%
Show abstract

To detect currently unannotated genes with low expression levels with high sensitivity and accuracy, we developed a new exon->gene->transcript annotation pipeline that can identify previously undetected multi-exon transcripts using large volumes of RNA-Seq data. Our pipeline incorporates three new algorithms: 1) model-based spliced exon detection, 2) exon-to-gene assignment across multiple tissue/datasets through exon community discovery, and 3) ranking top transcripts by a stepwise minimum flow procedure. The design of our pipeline allowed us to leverage hundreds of Tbases of public RNA-seq data as input to improve mouse and rat genome annotation. Using this data, our pipeline identified close to 15K and 21K unannotated genes in GENCODE M37 and ENSEMBL 114 for mouse and rat, respectively. Each species also gained over 200K predicted transcripts containing at least one new exon, although most were transcripts from GENCODE/ENSEMBL annotated genes with newly assigned exons. To make our genome annotation available for common use, we have packaged this new annotation in standard file formats for the analysis of bulk and single cell RNA-seq data (GTF, 10X genome files). We have also provided two use examples which demonstrate the utility of our newly annotated genes in functional analyses, showing that their expression can be differentially regulated in relationship to cell type and selective breeding. Due to the efficiency provided by our pipeline, we expect that as new RNA-seq data become available in the coming years it will significantly benefit rat gene/transcript annotation, eventually enabling us to approach the target of complete gene and transcript annotation.

10
Cell type-specific gene regulatory network inference from single cell transcriptomics with ctOTVelo

Chang, S.; Zhao, W.; Ma, Y.; Sandstede, B.; Singh, R.

2026-03-14 genomics 10.64898/2026.03.11.711174 medRxiv
Top 0.1%
14.7%
Show abstract

Inferring gene regulatory networks (GRNs) from gene expression is a crucial task for understanding functional relationships. Gene expression data (transcriptomics) provide a snapshot of gene activity, encoding information about gene regulatory relationships. However, gene regulation is a dynamic process, modulating across time and with different cell types. Temporal GRN inference methods aim to capture these dynamics by utilizing time-stamped transcriptomics, gene expression data of similar samples captured across discrete timepoints, or pseudotime transcriptomics, computationally ordering cells based on an inferred trajectory. These methods can estimate constant or temporal gene regulatory relationships, but may not capture finer, cell type specific relationships. We propose ctOTVelo, an extension to our previous work to account for cell type specificity during GRN inference. ctOTVelo incorporates cell type labels or proportions when inferring the GRN from single cell transcriptomics data. Our methods achieve state-of-the-art performance in GRN prediction in time-stamped and pseudotime-stamped transcriptomics. Furthermore, ctOTVelo is able to generate cell type specific GRNs, allowing cell type resolution analysis of gene regulatory relationships.

11
DemuxHMM: Large-Scale Single-Cell Embryo Profiling via Recombination Barcoding

Afanassiev, A. I.; Wei, K.; Yachie, N.; Sugioka, K.; Schiebinger, G.

2026-02-24 bioinformatics 10.64898/2026.02.23.703392 medRxiv
Top 0.1%
14.2%
Show abstract

High-resolution developmental time-courses with single-cell RNA sequencing (scRNA-seq) increasingly target trajectory inference and other analyses in the study of development and disease [1-6]. These datasets are often generated by pooling individuals and inferring cell-to-individual mappings after sequencing, in a process called demultiplexing. Existing demultiplexing methods are limited in the number of timepoints they can support, due to either the need for individual-by-individual processing or reduced accuracy at large numbers of individuals. To address these limitations, we introduce a combined experimental and computational framework for creating large-scale, individual-resolved datasets. Our framework couples a simple breeding scheme that creates contiguous SNP patterns (recombination barcodes) with a recombination-aware demultiplexing method, DemuxHMM, that explicitly models this structure with a Hidden Markov Model (HMM). We demonstrate substantial performance and scalability gains from this combined approach on simulated data, highlighting its potential to enable the creation of large-scale single-cell time series.

12
A run-length-compressed skiplist data structure for dynamic GBWTs supports time and space efficient pangenome operations over syncmers

Durbin, R.

2026-03-29 bioinformatics 10.64898/2026.03.26.714584 medRxiv
Top 0.1%
14.0%
Show abstract

Skiplists (Pugh, 1990) are probabilistic data structures over ordered lists supporting [O] (log N) insertion and search, which share many properties with balanced binary trees. Previously we introduced the graph Burrows-Wheeler transform (GBWT) to support efficient search over pangenome path sets, but current implementations are static and cumbersome to build and use. Here we introduce two doubly-linked skiplist variants over run-length-compressed BWTs that support [O] (log N) rank, access and insert operations. We use these to store and search over paths through a syncmer graph built from Edgars closed syncmers, equivalent to a sparse de Bruijn graph. Code is available in rskip.[ch] within the syng package at github.com/richarddurbin/syng. This builds a 5.8 GB lossless GBWT representation of 92 full human genomes (280Gbp including all centromeres and other repeats) single-threaded in 52 minutes, on top of a 4GB 63bp syncmer set built in 37 minutes. Arbitrarily long maximal exact matches (MEMs) can then be found as seeds for sequence matches to the graph at a search rate of approximately 1Gbp per 10 seconds per thread.

13
Privacy-Preserving Pangenome Graphs

Blindenbach, J.; Soni, S.; Gursoy, G.

2026-02-18 bioinformatics 10.64898/2026.02.16.706152 medRxiv
Top 0.1%
12.8%
Show abstract

The human pangenome reference, often represented as a graph, promises to capture genetic diversity across populations, but open release of individual haplotypes raises significant privacy concerns, including risks of re-identification and inference of sensitive traits. To address these challenges, we introduce PanMixer, a framework for privacy-preserving pangenome graph releases that selectively obfuscates an individuals haplotypes while retaining the utility of the reference graph. PanMixer formulates the privacy-utility trade-off as a knapsack problem, where privacy risk is quantified using information theory and utility is measured using graph properties. Using the recently released draft human pangenome containing 47 individuals, we show that PanMixer robustly reduces re-identification risk under linkage attacks and genome reconstruction attempts. We also show that PanMixer preserves the accuracy of key downstream applications, including allele frequency estimation, linkage disequilibrium analysis, and read mapping. By addressing privacy concerns, PanMixer enables the inclusion of individuals, particularly those from underrepresented populations, who might otherwise be reluctant to contribute but seek representation in future genomic studies. Our results provide both a practical tool and a generalizable framework for balancing privacy and utility in future large-scale pangenome references.

14
Pareto optimization of masked superstrings improves compression of pan-genome k-mer sets

Plachy, J.; Sladky, O.; Brinda, K.; Vesely, P.

2026-03-20 bioinformatics 10.64898/2026.03.18.712440 medRxiv
Top 0.1%
12.5%
Show abstract

The growing interest in k-mer-based methods across bioinformatics calls for compact k-mer set representations that can be optimized for specific downstream applications. Recently, masked superstrings have provided such flexibility by moving beyond de Bruijn graph paths to general k-mer superstrings equipped with a binary mask, thereby subsuming Spectrum--Preserving String Sets and achieving compactness on arbitrary k-mer sets. However, existing methods optimize superstring length and mask properties in two separate steps, possibly missing solutions where a small increase in superstring length yields a substantial reduction in mask complexity. Here, we introduce the first method for Pareto optimization of k-mer superstrings and masks, and apply it to the problem of compressing pan-genome k-mer sets. We model the compressibility of masked superstrings using an objective that combines superstring length and the number of runs in the mask. We prove that the resulting optimization problem is NP-hard and develop a heuristic based on iterative deepening search in the Aho-Corasick automaton. Using microbial pan-genome datasets, we characterize the Pareto front in the superstring-length/mask-run space and show that the front contains points that Pareto-dominate simplitigs and matchtigs, while nearly encompassing the previously studied greedy masked superstrings. Finally, we demonstrate that Pareto-optimized masked superstrings improve pan-genome k-mer set compressibility by 12-19% when combined with neural-network compressors.

15
Sample barcoding-associated technical variation in probe-based single-cell RNA sequencing

Weir, J. A.; Krebs, Y.; Chen, F.

2026-04-08 genomics 10.64898/2026.04.06.716804 medRxiv
Top 0.1%
12.4%
Show abstract

Probe-based single cell RNA sequencing approaches are increasingly becoming a technology of choice for profiling gene expression at scale and in archival tissues. The 10x Genomics Flex v1 assay enables cost-effective and high-sensitivity single-cell RNA sequencing by splitting samples across up to 16 uniquely barcoded probe sets before pooling and loading onto a single lane of a microfluidic chip. A natural consequence of this design is to leverage probe set barcoding as a sample barcoding strategy for case-control experiments. However, we observed that Flex v1 probe set barcode identity drives substantial technical variation between probe set barcodes, an effect that is reproducible across lanes and independent datasets. When Flex v1 probe set barcodes are confounded with biological sample identity, a concerning number of differentially expressed genes at standard thresholds are false positives. The Flex v2 assay, which decouples sample barcoding from probe set hybridization, significantly reduces this artifact. As the field continues to expand adoption of probe-based assays, our findings introduce probe set barcoding as an underappreciated source of technical variation in single-cell assays and emphasize the importance of experimental design when using probe-based sequencing technologies.

16
GraphHDBSCAN*: Graph-based Hierarchical Clustering on High Dimensional Single-cell RNA Sequencing Data

Ghoreishi, S. A.; Szmigiel, A. W.; Nagai, J. S.; Gesteira Costa Filho, I.; Zimek, A.; Campello, R. J. G. B.

2026-03-26 bioinformatics 10.64898/2026.03.24.713924 medRxiv
Top 0.1%
12.3%
Show abstract

Single-cell RNA sequencing (scRNA-seq) is widely used to resolve cellular heterogeneity across thousands to millions of cells. A major challenge is to identify biologically meaningful cell populations while preserving their hierarchical organization, because broad cell types frequently split into more specialized subtypes. However, state-of-the-art approaches mostly focus on flat partitions and ignore the hierarchical structure of single-cell data. Here we introduce GraphHDBSCAN*, a graph-based, hyperparameter-free extension of HDBSCAN* that performs hierarchical density-based clustering on a graph representation of the data, enabling robust recovery of both single-level and hierarchical relationships in high-dimensional and sparse datasets. We evaluate GraphHDBSCAN* across multiple scRNA-seq datasets and show that it recovers biologically meaningful hierarchies that reveal fine-grained structure in complex data, including monocyte subpopulations. In addition, the method yields high-quality flat partitions that outperform widely used community-detection methods.

17
Construction of distinct k-mer color sets via set fingerprinting

Alanko, J. N.; Puglisi, S. J.

2026-02-18 bioinformatics 10.64898/2026.02.16.706153 medRxiv
Top 0.1%
12.2%
Show abstract

The colored de Bruijn graph model is the currently dominant paradigm for indexing large microbial reference genome datasets. In this model, each reference genome is assigned a unique color, typically an integer id, and each k-mer is associated with a color set, which is the set of colors of the reference genomes that contain that k-mer. This data structure efficiently supports a variety of pseudoalignment algorithms, which aim to determine the set of genomes most compatible with a query sequence. In most applications, many distinct k-mers are associated with the same color set. In current indexing algorithms, color sets are typically deduplicated and compressed only at the end of index construction. As a result, the peak memory usage can greatly exceed the size of the final data structure, making index construction a bottleneck in analysis pipelines. In this work, we present a Monte Carlo algorithm that constructs the set of distinct color sets for the k-mers directly in any individually compressed form. The method performs on-the-fly deduplication via incremental fingerprinting. We provide a strong bound on the error probability of the algorithm, even if the input is chosen adversarially, assuming that a source of random bits is available at run time. We show that given an SBWT index of 65,536 S. enterica genomes, we can enumerate and compress the distinct color sets of the genomes to 40 GiB on disk in 7 hours and 17 minutes, using only 14 GiB of RAM and no temporary disk space, with an error probability of at most 2-82.

18
TCRseek: Scalable Approximate Nearest Neighbor Search for T-Cell Receptor Repertoires via Windowed k-mer Embeddings

Yang, Y.

2026-03-24 bioinformatics 10.64898/2026.03.20.713313 medRxiv
Top 0.1%
12.2%
Show abstract

The rapid growth of T-cell receptor (TCR) sequencing data has created an urgent need for computational methods that can efficiently search CDR3 sequences at scale. Existing approaches either rely on exact pairwise distance computation, which scales quadratically with repertoire size, or employ heuristic grouping that sacrifices sensitivity. Here we present TCRseek, a two-stage retrieval framework that combines biologically informed sequence embeddings with approximate nearest neighbor (ANN) indexing for scalable search over TCR repertoires. TCRseek first encodes CDR3 amino acid sequences into fixed-length numerical vectors through a multi-scale windowed k-mer embedding scheme derived from BLOSUM62 eigendecomposition, then indexes these vectors using FAISS-based structures (IVF-Flat, IVF-PQ, or HNSW-Flat) that support sublinear-time search. A second-stage reranking module refines the shortlisted candidates using exact sequence alignment scores (Needleman-Wunsch with BLOSUM62), Levenshtein distance, or Hamming distance. We benchmarked TCRseek against tcrdist3, TCRMatch, and GIANA on a 100,000-sequence corpus with precomputed exact ground truth under three distance metrics. Under cross-metric evaluation--where the reranking and ground truth metrics differ, providing the most informative test of generalization--TCRseek achieved NDCG@10 = 0.890 (Levenshtein ground truth) and 0.880 (Hamming ground truth), ranking highest among the retained baselines under Hamming and remaining competitive with tcrdist3 (0.894) under Levenshtein. When the reranking metric matches the ground truth definition (BLOSUM62 alignment), NDCG@10 reached 0.993, confirming that the ANN shortlist captures >99% of true neighbors--the expected ceiling of the two-stage design. On the 100,000-sequence corpus, TCRseek achieved 3.6-39.6x speedup over exact brute-force search depending on index type and distance metric, with the largest gains for alignment-based retrieval. These results demonstrate that embedding-based ANN search provides a practical and scalable alternative for TCR repertoire analysis.

19
cspray: Distributed Single Cell Transcriptome Analysis

Hawkins, P. G.; Swanson, E. M.; Feichtel, M.

2026-02-09 bioinformatics 10.64898/2026.02.06.704110 medRxiv
Top 0.1%
12.1%
Show abstract

The size of individual single cell samples continues to grow with advancing technologies, as do the number of samples included in individual experiments and across organizations. This presents challenges for processing this data at scale, both in terms of computational throughput and the required size of the machines that must process this data. We present a single cell RNA processing method that is fully distributed, capable of processing arbitrarily large files, and numbers of files, without requiring per-file based compute sizing. Our method, cspray, includes data ingestion, preprocessing, highly variable gene annotation, PCA, and clustering. We also show that this processing at scale permits LLM based reference-free cluster annotation on low resolution clusters, which demonstrates these techniques can be used to build single cell data discovery platforms at scale.

20
TandemTwister: Scalable genotyping and advanced visualization of tandem repeats

Al Raei, L. W.; Ghareghani, M.; Moeinzadeh, H.; Vingron, M.

2026-01-31 genomics 10.64898/2026.01.28.702315 medRxiv
Top 0.1%
12.1%
Show abstract

Tandem repeats are genomic regions consisting of consecutively repeated units with variable copy numbers and possible mutations. They are used in DNA fingerprinting and have been implicated in complex traits and genetic disorders, including neurodegenerative and developmental diseases. The vast and expanding number of tandem repeat loci in the human genome underscores the need for fast and scalable tools for accurate genotyping and visualization. An accurate tool for characterizing these variants is essential for understanding their functional impacts and associations with phenotypes. We developed TandemTwister, a novel algorithm implemented in C++, as a highly scalable and parallelized tool for tandem repeat copy number genotyping. Additionally, we created an interactive visualization tool to facilitate quick manual inspection, displaying exact motif occurrences, counts, and population information across haplotypes. TandemTwister demonstrates high accuracy and runtime efficiency for tandem repeat genotyping across all long-read sequencing technologies and assembled genomes. We evaluated the performance of TandemTwister in Ashkenazim trio on different sequencing technologies on a set of 1.2 million annotated tandem repeat regions. TandemTwister was the fastest and most accurate genotyping tool available for tandem repeats in comparison to the state of the art tools. For PacBio Hifi data as an example, TandemTwister was run in 17 minutes on 32 CPU cores resulting in 99.4% recall, 98.0% Mendelian consistency, and 94% sequence accuracy. We also showed a successful super-population clustering and examined inheritance patterns of tandem repeats and haplotype blocks in three trio sets. TandemTwister demonstrated its ability to detect pathogenic repeat expansions. We applied it in a cohort of 31 individuals with neurodegenerative and developmental disorders, successfully distinguishing healthy from pathogenic copy numbers.